My Iranian churn dataset was missing so I decided to choose the Seoul bike sharing demand dataset dataset found on UC Irvine Machine Learning Repository.
Abstract
The dataset contains count of public bikes rented at each hour in Seoul Bike haring System with the corresponding Weather data and Holidays information.
Data Set Information
Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes. The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.
The objective of this notebook is to analyze the dataset I get assigned and to make from there:
We can import here all the libraries we need for the notebook.
First, I can import the dataset and show some lines to see what the features are and their format.
We canc clearly see that out dataframe deserve some little changes to be more suitable to work with.
Then, as we have only one categorical variable season (for now), we can display a short descriptions of the numerical features. Here we have :
We will go deeper in these descriptions during the analysis.
Let's start our analysis dealing with missing values. To be sure the dataset is cleaned, I compute the number of missing values for each columns on the entire dataset.
Surprisingly, there is no missing value (rare event). The dataset is very clean and homogeneous.
Before to dive into the analysis, let's explore the number of rented bikes on hours according to whether or not this is a functionning day.
We can see that the number of rented bike is 0 for hours when this is a non functionning day. In other words, we can not train a model on the features to predict ifthe number of rented bikes per hours based on the non functionning days. So I remove these samples from the dataset.
As I said before, the target will be the number of bike rented. More precisely, each sample gives the number of bike rented for each hours from 2017-12-01 to 2018-11-30. We can conclude it is a regression task to predict the number of bikes rented for a given hour based on the temporal and meteorologiical features the dataset contains.
For these hours, I want to see the distribution of the number of rented bikes.
TRANSFORMATION REMINDER >
Square Root
The square root method is typically used when your data is moderately skewed. Now using the square root is a transformation that has a moderate effect on distribution shape. It is generally used to reduce right skewed data. Finally, the square root can be applied on zero values and is most commonly used on counted data.
Logarithmic
The logarithmic is a strong transformation that has a major effect on distribution shape. This technique is, as the square root method, oftenly used for reducing right skewness. However, is that it can not be applied to zero or negative values.
Here we can see the distribution is right skewed. A square root or log transformation is needed to standize this distribution. I will choose a square root transformation because don't want a high effect because of the values from approximatively 500 to 1200. We see on this range 2 gaussian-like subdistribution.
(For the example I will plot both the distribution of log and sqrt distribution)
We can conclude a sqrt is most suitable for the n_bike distribution to have a more gaussian-like distribution. Moreover, we can clearly see the 2 gaussian like distributions I'll spoke about before are leading to a left skewed distribution with a log-transform.
The second interesting part is that we have a kind of time serie dataset. High is the temptation to deal with this dataset with this way of thinking. As this dataset is about bike rental in time, I am not convinced this is a time serie question. Let's plot the number of bike rented for each hour over time.
As I said, even looking at this plot infer the idea of a time serie. But if we look at the plot with a better precision and with the data knowledge, we can see the number rented bike is not really a time serie problem
With the knowledge of what a bike rent is, we can clearly suppose the number of rent is not given by the previous rents (even if a trend can be seen, only one year data is not sufficient to interpolate this trend). The number of rent is given by the week days habits and the seasons, thats why our meteorological features will be helpfull.
First, we can extract more informations from the date and hour to confirm my supposition. I extract :
My objective through the following analysis is to prove (or not) the time features are good categorical features (and not numerical ones for time series)
First I have a doubt on the efficience for the day number of the current month to give informations on the number of rented bikes.
To be honest, I was hoping there was a trend to rent more bikes at the beginning of the month because the salary is recent. But no. To be at the begining, at the middle or at the end of the month seems to have almost zero influence on our target.
Also, the year is not usable as the range of time is 2017-12-01 to 2018-11-30 so an exactly one-year window.
And what about the months and seasons ?
In this part, I will be interested to see if the month (categorical) is a good feature to predict the number of bikes rented for a given hour in a given day. To do so, I display some aggregates for each months and I plot the notch-boxplot of distribution for the relatives number of bike rented.
Here we clearly see monthes has an impact on the number of bikes rented. Moreover we can suppose the season will be a good resume of this plot. Indeed we see for the winter monthes less bikes rented and inversely.
In this part, I will show the same things I did for the month to see if the seasons are good features and a good summary of the month-result above.
As I said, during the winter months there are less bike rented.
In this part, I will do the same thing as the month and season features but on the holiday one.
It is not as clear as for the seasons or the months but there is a trend for less bike rented during holidays. We almost can conclude these bikes are rented to go to work.
If my previous conclusion is valid, then, by plotting the mean number of rented bikes according to the hour we will see two peaks around 8h00 and 18h00.
As I supposed, we can see these peaks. So I can make a better supposition. This distribution will be improved for working days (monday to friday in Corea) and we probably can see another distribution for the weekend days.
For this part, I will plot the distribution of the number of rented bikes according to the hour. I will do it splitting the curves :
Bingo ! The distribution between :
To conclude : I was right saying the time serie idea is not as good as we can suppose first. The months, seasons, holidays, hours etc. as categorical are features are far most better features than a continuous time.
In this part I will explain what are the other features. Only remains the meteorological ones.
The dew point (°C) or dew temperature (°C) is the temperature below which condensation naturally occurs. More technically, below this temperature which depends on the ambient pressure and humidity, the water vapor contained in the air condenses on surfaces, by saturation effect.
Concretely, the more the dew point and the temperature are high, the more the bike rider will feel confortable (asssuming natural Corean temperatures).
I will deal with these two features simultaneously as we can assume these temperatures are correlated. Of course, I will plot a first scatter point to validate this assumption. Then I will plot the temperatures distribution to see if it need some transformation. Then the last plot is showing the relation between the temperatures and the number of bikes rented (already sqrt-transformed).
The two tempatures seems correlated. So it is a good point for me I can compare them. Moreover, their distribution is near a gaussian one, so I let these two features without transformation. Finally, we can see the concrete argument I said before : the more the dew point and the temperature are high, the more the bike rider will feel confortable. It leads to more bike rented
Relative air humidity (%) is a measure of the ratio between the water vapor content of the air and its maximum capacity to contain water vapor under these conditions.
The relative humidity of the ambient air influences the evaporation of sweat, and thus the cooling of the body. Too low a humidity level increases cooling and increases the efficiency of perspiration, while too high a humidity level limits cooling and thus increases the feeling of warmth.
I will plot the humidity distribution to see if it need some transformation. Then the last plot is showing the relation between the humidity and the number of bikes rented (already sqrt-transformed).
The humidity distribution is near a gaussian one (except near 100%), so I also let it without transformation. And, we can see the concrete argument I said before : the more the humidity is low, the more the bike rider will feel confortable. It leads to more bike rented.
For the wind speed (m/sec.), nothing special to say except that there is no strong wind which is considered to be from 14 m/sec. I out dataset the max is only 7.4 m/sec. so I am not sure this features will have a strong influence.
Yet, let's plot the wind speed distribution to see if it need some transformation. Then the last plot is showing the relation between the wind speed and the number of bikes rented (already sqrt-transformed).
I expected to see that the wind speed will lead to less bikes rented but it is the opposite. I am suprised of this result but I can suppose it was not depending only on the wind but with crossed features, or maybe the wind can be confortable for the rider in some mesure. I also see that the wind speed has a right skewed distribution. So I will apply a log-transform on this feature.
As know, the solar radiance (W/m2) corresponds to a flow of radiation from the sun. In other words, the more there are radiation, the more we are exposed to the sun.
I plot the solar radiation distribution to see if it need some transformation. Then the other plot is showing the relation between the solar radiation and the number of bikes rented (already sqrt-transformed).
Here three conclusions can be made :
So I will apply a square root transformation for three reasons :
The rainfall correspond (mm) to the rain fallen during the current hour. The more there are rain, the riskier riding is.
The rain fall is given by the rain fallen for a given our. So estimating the rain have a effect on dryness for 2 hours I will apply moving average of rainfall for the last 2 hours.
As I did for the others meteorological features, I plot the rainfall distribution to see if it need some transformation. Then the other plot is showing the relation between the rainfall and the number of bikes rented (already sqrt-transformed).
The plot showing the regression line is clearly indicates the more is it rainy, the less the bikes are used. But I can't clearly see the distribution when there are rain. Let's filter our plots on rainy hours.
Here we can see, the rainy hours are really really right skewed. Moreover, we clearly see the inverse proportionnality with the number of rented bikes. So we will try another feature dryness (1/mm) given by $$dryness = \frac{1}{rain + 1}$$ avoiding ZeroDivisionError.
It is a really interpretative transformation. I will keep it as :
The snowfall correspond (mm) to the rain fallen during the current hour. The more there are rain, the riskier riding is.
The snow fall is given by the snow fallen for a given our. So estimating the snow have a effect on dryness for 8 hours in the city of Seoul I will apply moving average of snowfall for the last 8 hours.
I plot the snowfall distribution to see if it need some transformation. Then the other plot is showing the relation between the snowfall and the number of bikes rented (already sqrt-transformed).
To be honest, I was thinking the rainfall and the snowfall will be the same. Of course, at first glance, we can conclude the snower, the more the bike will be rented. Yet looking more precisely, we see it is more a categorical variable. If there are 1 or 10 mm of snow, the number of bike rented seems to follow the same distribution. In other words, snow is like disqualifying. So I choose to transform this variable into a boolean one and see the results on a notch box plot
Indeed, we can see the distribution of number of bike rented is restricted. So the feature snowing is a great categorical features for snowy hours.
Only remains the visibility (10m) feature, which correspond to the visibility at horizon for human eye (default 2000)
I plot the visibility distribution to see if it need some transformation. Then the other plot is showing the relation between the visibility and the number of bikes rented (already sqrt-transformed).
As we can see, the visb is by default at 2000 which is weird to interpret as a maximum visibility. So I decide to change the visibility feature to invisb which reverts the distribution. It will be interpreted as the removed (reduced) visibility (10m)
Now it is easier to interpret. The more the visibility is reduced, the less the bikes are rented.
Before completing the preprocessing of the dataset, I switch the variable containing this data from df to data and we can have a simple look at this data as a reminder.
Let's remember that holiday, working_day, snowing are already one hot encoded as they are boolean feautres. Only remains :
Important note : I save every dummies in a dummies.json file
Now we have all our categorical features ready to use. Let's plot the histogram and describe the numerical ones as a reminder.
The humidity, and the dryness are in a given interval ([0, 100] and [0, 1]) so I will normalize these features. I will also normalize the solar feature as I want to keep the 0 value the same.
Remains the number of bike, the temperature, the dew point and the wind I will standardize. Particular point for n_bike : I put the mean and std in variables to unfit it later on predictions. And I saved the transformation applied in a transformations.json file
To have a idea on how the preprocessing is important I think it is fun to create a train-test split of the dataset without preprocessing. Of course if I am honest, I will at least make a one hot encoding of categorical variables days, remove the non functionning days and date will be timestamped to be considered as a time serie.
As I already said, we have a regression problem. I need to test some regression models. To do so, let's display the models sklearn give us.
There are A LOT of models. I decide to choose some of these models :
IMPORTANT: For each model (except the Baseline of course), the cells will be the exact same except for the first one where the grid search hyperparameters are defined. It leads to an homogeneous way to compare these models. So read the code once.
Before diving into model grid searching and fitting, let's talk about my objectives. For each model :
To compute RMSE I need a custom method to return the loss value and unfit the transformed target if needed
Then I need a method to compute the relative error distribution to have an idea of the error made by the models. (I try this method with random false data)
And I also need a method to plot the same relative error according to the target values (Also tested on the false random data)
Last but ot least, I need a method to perform the grid search homogeneously to be able to honestly compare the following models. It performs a cross-validation shuffling the training set into 4 parts. Also, a random seed is set to make my experiments repoductible.
My first model is not a model made by sklearn. It is always a good idea to have a baseline of the error without any model. In other words, I take the mean target for the two datasets and I compute the error on the respectives test sets.
We can see our preprocessing on the target (sqrt) gives us a poorest baseline on the target. Nothing to worry, on the contrary, it is logical to have a worst prediction as the mean is driven by the extrems. So with our sqrt transform, it leads to underestimate the high extremes.
Of course we can see the predictions for with Baseline model are better on the dataset without transformation. Moreover we know the Baseline model on the transformed dataset is underestimated the number of bike rented.
Model Name: LinearRegression
Description: Ordinary Least Squares algorithm
Prevents Overfitting: no
Handles Outliers: no
Handles several features: no
Adaptive Regularization: no
Large Dataset: no
Non linear: no
Interpretability Score: 5 / 5
When to Use: Highly interpretable, no introduced bias
When to Use Expanded:
$\qquad$- Data consists of few outliers
$\qquad$- Little variance between output labels
$\qquad$- All of the input features are not only independent but also are not correlated.
Advantages:
$\qquad$- Easy to interpret results
$\qquad$- Low complexity level
Disadvantages:
$\qquad$- At risk of multicolinearity if input features are correlated
$\qquad$- Small errors/outliers in target values can drastically impact model
Sklearn Package: linear
Required Args: None
Helpful Args: None
Variations: None
Grid Search:
fit_intercept: bool, default=True
Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).normalize: bool, default=False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use sklearn.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.
Here we have our first prove of the transformation's effience. For this simple linear model we have a clearly better prediction with the transformed dataset. It proves the new featuresare individually meaningful compared to the raw ones. So we can try the ElasticNet which can be useful when there are some correlation between input features and to avoid overfitting.
Model Name: ElasticNet
Description: Ordinary Least Squares with both an L1 and L2 regularization term. The weights of the L1 vs. L2 regularization terms are controlled by an l1_ratio parameter.
Prevents Overfitting: yes
Handles Outliers: no
Handles several features: yes
Adaptive Regularization: no
Large Dataset: no
Non linear: no
Interpretability Score: 3 / 5
When to Use: Blend Ridge and Lasso
When to Use Expanded:
$\qquad$- Data consists of few outliers
$\qquad$- May be some correlation between input features
$\qquad$- Avoid overfitting
$\qquad$- Feature selection
Advantages:
$\qquad$- Incorporate the feature selection abilities of Lasso with the regularization abilities of Ridge.
Disadvantages:
$\qquad$- In lowering variance, incorporates a degree of bias into the model.
$\qquad$- Can be difficult to tune alpha to attain a desirable balance between OLS and regularization terms
$\qquad$- Higher computational cost than Ridge or Lasso
Sklearn Package: linear
Required Args: None
Helpful Args: alpha (controls strength of regularization terms) l1_ration (controls ratio between L1 and L2 regularization terms)
Variations: ElasticNetCV MultiTaskElasticNet
Grid Search:
alpha: float, default=1.0
Constant that multiplies the penalty terms. Defaults to 1.0. See the notes for the exact >mathematical meaning of this parameter. alpha = 0 is equivalent to an ordinary least square, >solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso >object is not advised. Given this, you should use the LinearRegression object.
l1_ratio: float, default=0.5The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
In this case, the model on untransformed dataset is better. It is because data consists of fewer outliers compare to the transformed dataset due to the high number of features. I am tented to try the HuberRegressor, which is good for quick analyses of data ignoring outliers even if it can break down with large numbers of input features.
Model Name: HuberRegressor
Description: A linear model designed to deal with outliers in the data and/or corrupted data. Does not ignore the outliers, but rather gives them a lower weight.
Prevents Overfitting: no
Handles Outliers: yes
Handles several features: no
Adaptive Regularization: no
Large Dataset: no
Non linear: no
Interpretability Score: 4 / 5
When to Use: Outliers and want quickest algorithm
When to Use Expanded:
$\qquad$- Want quick analyses of data ignoring outliers
Advantages: '
$\qquad$- Faster than RANSAC and TheilSen (as long as the number of samples is not too large)
$\qquad$- Does not completely ignore data points it deems as outliers
Disadvantages:
$\qquad$- Break down with large numbers of input features
Sklearn Package: linear
Required Args: None
Helpful Args: max_iter (maximum number of iterations to perform)
Variations: None
Grid Search:
epsilon: float, greater than 1.0, default 1.35 The parameter epsilon controls the number of samples that should be classified as outliers. The smaller the epsilon, the more robust it is to outliers.
Interestingly, the HuberRegressor does not work on the untransformed dataset. It seems to underfits the data by completly ignoring the outliers. I have to make a compromise between ElasticNet and HuberRegressor so I will choose to try the Bayensian Rigde.
Model Name: BayesianRidge
Description: Similar to Ridge but the regularization parameter is tuned to fit the data during the training process.
Prevents Overfitting: yes
Handles Outliers: no
Handles several features: no
Adaptive Regularization: yes
Large Dataset: no
Non linear: no
Interpretability Score: 2 / 5
When to Use: Ridge but don't want to set regularization constant
When to Use Expanded:
$\qquad$- Are seeking results similar to Ridge, but willing to sacrifice interpretability for time saved not having to test different regularization weights
Advantages:
$\qquad$- No need to tune alpha value
$\qquad$- Adapts well to data on hand
Disadvantages:
$\qquad$- Less interpretable results
Sklearn Package: linear
Required Args: None
Helpful Args: None
Variations: None
Grid Search:
n_iter: int, default=300
Maximum number of iterations. Should be greater than or equal to 1.
Here we are ! The BayesianRigde show us one thing. It can be hard to be sure of that but let's summarize my thoughs :
So I think there is no regularization needed as the LinearRegression is the best model for now and the other models tried are some variants with Regularization. I am no sure, if I try another variant, it will lead to the same conclusion. Let's go for the ARDRegression.
Model Name: ARDRegression
Description: BayesianRidge with sparser weight values. Almost like a version of BayesianLasso.
Prevents Overfitting: yes
Handles Outliers: no
Handles several features: yes
Adaptive Regularization: yes
Large Dataset: no
Non linear: no
Interpretability Score: 2 /5
When to Use: Lasso but don't want to set regularization constant
When to Use Expanded:
$\qquad$- Are seeking results similar to Lasso, but willing to sacrifice interpretability for time saved not having to test different regularization term weights
Advantages:
$\qquad$- No need to tune alpha value
$\qquad$- Adapts well to data on hand
$\qquad$- Reduces weight of unimportant features
Disadvantages: '
$\qquad$- Less interpretable results
$\qquad$- Computationally expensive (can't handle very large datasets)
Sklearn Package: linear
Required Args: None
Helpful Args: None
Variations: None
Grid Search:
n_iter: n_iterint, default=300
Maximum number of iterations.
And yes, I obtain the same conclusion I had for the BayesianRidge. It is time to stop trying linear models and move on for other regressors. I will try the KNeighborsRegressor.
Model Name: KNeighborsRegressor
Description: Creates a model based off of the k nearest neighbors at any given point. Where k is an input argument.
Prevents Overfitting: no
Handles Outliers: no
Handles several features: no
Adaptive Regularization: no
Large Dataset: no
Non linear: yes
Interpretability Score: 5 / 5
When to Use: Nonlinear data, interpretability is important, unimportant features
When to Use Expanded:
$\qquad$- When you are unsure of the structure of your data and want a model that will fit well
$\qquad$- Not concerned with overfitting
$\qquad$- Interpretability is important
Advantages:
$\qquad$- Fits very well to data of various structures
$\qquad$- More interpretable than other nonlinear models
Disadvantages:
$\qquad$- Extremely impacted by outliers and corrupt data
$\qquad$- Need several more samples than features for quality results
$\qquad$- Difficulty dealing with large numbers of features
Sklearn Package: neighbors
Required Args: None
Helpful Args: n_neighbors (number of neighbors to use)
Variations: None
Grid Search:
n_neighbors: sint, default=5
Number of neighbors to use by default for kneighbors queries.weights: str or callable, default=’uniform’
weight function used in prediction. Possible values: ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally. And ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.p: int, default=2
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
We have really poor results on the transformed dataset. It can be easly explained :
One RandomForestRegressor can deal with these disadvantages.
Model Name: RandomForestRegressor
Description: A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
Prevents Overfitting: yes
Handles Outliers: yes
Handles several features: yes
Adaptive Regularization: no
Large Dataset: yes
Non linear: yes
Interpretability Score: 3 / 5
When to Use: Nonlinear data groups in buckets
When to Use Expanded:
$\qquad$- Data is not linear and is composed more of "buckets"
$\qquad$- Number of samples > number of features
$\qquad$- There are dependent features in the input data. DTR handles these correlations well.
Advantages:
$\qquad$- Can export tree structure to see which features the tree is splitting on
$\qquad$- Handles sparse and correlated data well
$\qquad$- Able to tune the model to help with overfitting problem
Disadvantages:
$\qquad$- Prediction accuracy on complex problems is usually inferior to gradient-boosted trees.
$\qquad$- A forest is less interpretable than a single decision tree.
Sklearn Package: tree
Required Args: None
Helpful Args: criterion and max_depth
Variations: gradient-boosted trees
Grid Search:
n_estimators: int, default=100
The number of trees in the forest.criterion: str, default=”mse”
The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.max_depth: int, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.max_features: str, int or float, default=”auto”
The number of features to consider when looking for the best split.
The RandomForestRegressor is the best model I have tried. It reach the best results fot both datasets, transformed and untransformed data and is a few better on the transformed dataset. I will stop the models enumeration and will compare the final results.
As I ran the same cells to have an homogeneous way to compare the models, the model comparison will be really easy. I only group the results but dataset and sort the RMSE values ascendingly according to the transformed dataset.
Here we have the same conclusions, the KNeighborsRegressor is a poor model to predict the number of rented bikes, then the Linear models are better preferably without any regularization. And the RandomForestRegressor is clearly the best model to choose for the two dataset, even it works better on the transformed one.
It can be interesting to see what are the most important features on our case to predict the number of bike rented for a given hour. Of course, as we have 89 feautures I will limit the bar plot to the top 30 features considering the other ones are not important to our result.
Not surprisingly, the meteorological feautures are globally the most important ones. The the working hours are following with 18h for instance.
I we can conclude from this plot :
- There are more bikes rented when the weather is good and the temperature is high
- There are more bikes rented during peak hours
I have chosen my model so I need to create the methods for the model to be deployed.
First I save my RandomForestRegressor model
On the client side, it will be ask to create a matrix of features to predict. Each row is a list of ordered features as following :
| Feature | Format | |||
|---|---|---|---|---|
| Date | dd/mm/yyyy | |||
| Hour | int | |||
| Temperature | °C | |||
| Humidity | % | |||
| Wind speed | m/s | |||
| Visibility | 10m | |||
| Dew point temperature | °C | |||
| Solar Radiation | MJ/m2 | |||
| Rainfal | mm | |||
| Snowfall | cm | |||
| Seasons | {"Winter", "Autumn", "Spring", "Summer"} | |||
| Holiday | {"Holiday", "No Holiday"} | Feature | Format |
I recreate the preprocessing function :
To test if my functions are well implemented, I get the raw values from the csv file and I perform the preprocessing on this matrix.
Then I compare the X I had on the part 5.4 Train-test split with the matrix of values from raw for :
Columns are in the same order and are the same. All the features are also the same except for the dryness but it can be cause by an approximation computation. Everything is clean.
Then remains the server side where the model is deployed. From the client side, it will be asked to resquest the API (host on local host for now) from the '/predict' route with the json matrix of features to predict.
Here the python code from the deployed_model.py file:
# Import libraries
import json
import pickle
import numpy as np
from flask import Flask, request, jsonify
app = Flask(__name__)
# Load the model
model = pickle.load(open('rf_model.pkl','rb'))
with open('transformations.json') as f:
transformations = json.load(f)
mean, std = transformations['n_bike']['mean'], transformations['n_bike']['std']
@app.route('/predict',methods=['POST'])
def predict():
# Get the data from the POST request.
data = request.get_json(force=True)
# Make prediction using model loaded from disk as per the data.
inputs = np.array(data['inputs'])
prediction = (model.predict(inputs) * std + mean) ** 2
return jsonify(list(prediction))
if __name__ == '__main__':
app.run(port=5000, debug=True)
We can run this server from terminal running the python deployed_model.py or python3 deployed_model.py according to your OS.
I can first check I can run the request without any issue as a sanity check